2019 Stack Overflow Developer Survey Results

Introduction

This data analysis project uses the 2019 Stack Overflow Developer Survey Results dataset which can be found here.

The survey had a total of 88,883 respondents, with 85 questions, ranging from general questions such as age, gender and academic background to professional questions such as what are the technologies worked with, employment status and more. The survey also included some questions regarding the respondent's interactions with Stack Overflow, but those were not a part of the objective of this analysis.

The dataset has not been modified further than what is done in this Notebook. Originally, the dataset is already very clean and structured, which means most of the modifications were for the sake of making the data easier to visualize.

The analysis uses the pandas and NumPy libraries to load and modify the data. For the visualization, I chose Plotly (Express) for two reasons:

  • I am already familiar with Matplotlib and so this is a chance to learn a new plotting library;
  • The interactivity provided by the charts in Plotly.

Almost all the plots are created with Plotly Express, a high-level API in the Plotly library. With that said, the table plot used in the last part of the analysis is created through the more traditional Graph Objects API.

Structure

The analysis has four main sections:

  • General Data: includes general data about the respondents, including the gender distribution, academic background, employment status and country;
  • Tech Stack: all the technology-related data, including programming languages, database environments, operating systems, etc.;
  • Professional Life: data related to the professional lives of the respondents, such as work week hours, work location, most important job factors, etc.
  • Other: explores questions that combine data from the previous sections, like "Programming Languages vs Annual Compensation" or "Work Location vs Work Week Hours".

The Conclusions at the very end aggregate all the "Key Takeaways" extracted from each of the enumerated sections. It has nothing new, only a TL;DR with the key takeaways of this data analysis.

Objectives

The following are the guiding questions and/or objectives of this analysis. They are answered throughout the four outlined sections, in the "Key Takeaways" at the end of each section:

  • What are the most used technologies (programming languages, database environments, operating systems, etc.);
  • Where are respondents from and what are their work conditions and employment status;
  • How does remote work factor in today's professional life (compensation and work week hours);
  • Tech Stack data vs Compensation

Data Analysis

Imports

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter
from typing import List, Optional

Global Variables

In [2]:
# Color for the General section charts
general_color = "#60BD68"
# Gradient for the General section charts
general_color_gradient = ["#30ff00", "#29eb00", "#21d100", "#18b700", "#11a400"]

# Color for the Tech Stack section charts
tech_stack_color = "#5DA5DA"
# Gradient for the Tech Stack section charts
tech_stack_color_gradient = ["#009fff", "#008fe3", "#0084d1", "#0075b8", "#0069a4"]

# Color for the People section charts
people_color = "#DECF3F"
# Gradient for the People section charts
people_color_gradient = ["#f5ff00", "#dbe100", "#c1c200", "#b3b200", "#a4a000"]

# Font size for all charts
chart_font_size = 14

Helper Functions

These are functions created for the sake of facilitating common operations throughout the analysis, broken down into three categories for the sake of organization in the Notebook: general operations, data transformations and plotting.

General operations

In [3]:
def count_options_frequencies(replies: pd.Series) -> Counter:
    """
    Given a Series of data entries of semicolon-separated options
    (i.e., replies for multiple choice questions), count the frequencies
    of each unique option. 
    The function also works for single choice questions as the split
    doesn't affect the output.
    """
    # Create a dict-like object to map the options to their frequencies
    options_freqs = Counter()
    
    # Split each data value at semicolons and, with the resulting individual\
    # options, update their frequencies accordingly
    [options_freqs.update(reply.split(";")) for reply in replies]
    
    return options_freqs

Data transformations

In [4]:
def prepare_dataframe(
    options_counts: Counter, 
    headers: list,
    ascend_sort: Optional[bool]=False
    ) -> pd.DataFrame:
    """
    Given a Counter with the frequencies of each reply option to a
    single question, create a new dataframe based on that data.
    """
    # Get the headers for each column
    options_header = headers[0]
    freqs_header = headers[1]
    
    # Get the data for each column
    # The options are the keys from the Counter
    options = list(options_counts)
    # The frequencies are the values from the Counter
    frequencies = [options_counts[option] for option in options_counts]

    # Store the options and respective frequencies in a new DataFrame
    new_df = pd\
        .DataFrame({
            options_header: options,
            freqs_header: frequencies
        })

    # Sort the dataframe in descending order of frequencies
    new_df.sort_values(
        by=freqs_header, 
        ascending=ascend_sort, 
        inplace=True
    )
    
    # Reset the DataFrame index according to the sorted data
    new_df.reset_index(inplace=True, drop=True)

    # Uniformize the options with some common string replacements
    new_df[options_header] = new_df[options_header].replace(
        {"nan": "NA", "Other(s):": "Other(s)"}, regex=False
    )

    return new_df
In [5]:
def unpivot_delimited_data(
    series: pd.Series,
    delimiter: str
    ) -> pd.DataFrame:
    """
    Given a Series where each data value is a group of delimiter-separated
    options, unpivot the individual options into separate rows, keeping
    the ids for each corresponding individual value.
    """
    # The resulting Series has a multi-level index, where the first level\
    # represents a single reply and the second level represents the\
    # multiple options chosen for that reply
    return series\
        .apply( lambda x: pd.Series(x.split(delimiter)) )\
        .stack()
In [6]:
def merge_unpivoted_data(
    unpivoted_df: pd.DataFrame,
    other_df: pd.DataFrame,
    other_df_key: str,
    first_col_name: str,
    second_col_name: str
    ) -> pd.DataFrame:
    """
    Given a DataFrame of unpivoted data and a DataFrame ready to be merged,
    merge the latter to the former.
    """
    # The current multi-level index is kept as columns before\
    # resetting the index. These columns are named level_0 and\
    # level_1, which means we now have a DataFrame with the old\
    # index and the unpivoted programming languages
    unpivoted_df = unpivoted_df.reset_index()

    # Merge the DF of unpivoted data with the other DF using the\
    # shared indices
    # From the first DF, the key column is the one with the first\
    # level of indices. The key from the other DF is specified as\
    # an argument
    merged_df = pd.merge(
        unpivoted_df, 
        other_df, 
        left_on="level_0", 
        right_on=other_df_key
    )

    # Keep only the columns of interest after the merge (during\
    # the merge, the first column ends up being called zero)
    merged_df = merged_df[[0, second_col_name]]

    # And rename the first column
    merged_df.rename(
        columns={0: first_col_name},
        inplace=True
    )
    
    return merged_df
In [7]:
def sort_by_top(
    df: pd.DataFrame,
    target_column: str,
    top_n_list: List[str]
    ) -> pd.DataFrame:
    """
    Convert a DataFrame Series to the Categorical data type
    using a list of values as basis. When sorting the column,
    the order of the categories in the list is kept.
    """
    # Convert the column with unpivoted data to Categorical,\
    # using the list of top n as the available categories
    df[target_column] = pd.Categorical(
        df[target_column], 
        top_n_list
    )
    
    # When sorting this column of now Categorical values, the order\
    # of the options in the top n list will be kept
    df.sort_values(target_column, ascending=False, inplace=True)
    
    return df
In [8]:
def get_mode_df(
    df: pd.DataFrame,
    groupby_column: str,
    mode_column: str
    ) -> pd.DataFrame:
    """
    Given a two-column DataFrame where the first column has the
    unpivoted options to a single question and the second column
    has replies of another question, find the mode of each of
    those replies for each unique option in the first column.
    These results are returned as a new two-column DataFrame.
    """
    # Group the DataFrame by the first column and find the\
    # most common value in the second column for each option\
    # in the former
    df_mode = df.groupby([groupby_column]).apply(
        lambda single_option_df: 
            single_option_df[mode_column].mode() 
    )

    # Reset the new DataFrame's index so that the data is\
    # properly put in distinct columns
    df_mode = df_mode.reset_index()
    
    # Rename the columns after the headers were lost in the grouping
    df_mode.rename(
        columns={0: mode_column},
        inplace=True
    )
    
    return df_mode

Plotting

In [9]:
def scatter_plot_preview(data_series: pd.Series) -> None:
    """
    Plot a Series of raw data to find its limits visually.
    """
    # Sort the data in ascending order
    data_series = data_series.sort_values(ascending=True)
    
    # Plot the Series in a scatter plot to explore it visually
    fig = px.scatter(
        x=data_series, 
        y=range(0, data_series.shape[0])
    )
    
    fig.show()
In [10]:
def plot_column_chart(
    data: pd.DataFrame, 
    x_data: str, 
    y_data: str, 
    data_labels: str, 
    title: str, 
    color: str,
    axes_titles: Optional[List[str]]=None
    ) -> None:
    """
    Given the necessary data, with the DataFrame columns for
    the axes and data labels specified, plot a column chart.
    """
    # Plot the chart
    fig = px.bar(
        data, x=x_data, y=y_data, 
        text=data_labels, 
        title=title
    )

    # Add the data labels inside the columns and change the color of the columns
    fig.update_traces(
        texttemplate="%{text:.2s}", 
        textposition="inside",
        marker_color=color
    )
    
    # If custom axes titles were passed, update the axes accordingly
    if axes_titles:
        fig.update_layout(
            xaxis_title=axes_titles[0], 
            yaxis_title=axes_titles[1]
        )
    
    # Adjust the font and center the title
    fig.update_layout(
        uniformtext_minsize=chart_font_size, 
        uniformtext_mode="hide",
        title_x=0.5
    )
    
    # Show the finalized plot
    fig.show()
In [11]:
def plot_bar_chart(
    data: pd.DataFrame, 
    x_data: str, 
    y_data: str, 
    data_labels: str, 
    title: str, 
    color: str,
    axes_titles: Optional[List[str]]=None
    ) -> None:
    """
    Given the necessary data, with the DataFrame columns for
    the axes and data labels specified, plot a bar chart.
    """
    # Plot the chart
    fig = px.bar(
        data, y=y_data, x=x_data, 
        text=data_labels, 
        title=title, 
        orientation="h"
    )
    
    # Add the data labels inside the columns and change the color of the columns
    fig.update_traces(
        texttemplate="%{text:.2s}", 
        textposition="inside", 
        marker_color=color
    )
    
    # If custom axes titles were passed, update the axes accordingly
    if axes_titles:
        fig.update_layout(
            xaxis_title=axes_titles[0], 
            yaxis_title=axes_titles[1]
        )
    
    # Adjust the font, use a log scale for the X axis and center the title 
    fig.update_layout(
        uniformtext_minsize=chart_font_size, 
        uniformtext_mode="hide", 
        xaxis_type="log", 
        title_x=0.5
    )
    
    # Show the finalized plot
    fig.show()
In [12]:
def plot_pie_chart(
    data: pd.DataFrame, 
    values: str, 
    names: str, 
    title: str, 
    color: List[str],
    donut_hole: Optional[float]=0.0, 
    show_legend: Optional[bool]=True
    ) -> None:
    """
    Plot a pie chart, which can be a donut chart with
    an optional legend.
    """
    # Plot the chart
    fig = px.pie(
        data, values=values, names=names,
        title=title, 
        hole=donut_hole, 
        color_discrete_sequence=color
    )
    
    # Show the data labels and the percentages, outside the wedges
    fig.update_traces(
        textinfo="label+percent", 
        textposition="outside"
    )
    
    # Adjust the font, hide/show the legend and center the title
    fig.update_layout(
        uniformtext_minsize=chart_font_size, 
        uniformtext_mode="hide", 
        showlegend=show_legend, 
        title_x=0.5
    )
    
    # Show the finalized plot
    fig.show()
In [13]:
def plot_histogram(
    x_series: pd.Series,
    n_bins: int,
    title: str,
    color: str,
    axes_titles: Optional[List[str]]=None
    ) -> None:
    """
    Plot an histogram of an already binned Series of
    data using count as the aggregate function.
    """
    # Plot the chart
    fig = px.histogram(
        x=x_series,
        nbins=n_bins,
        title=title
    )
    
    # Change the color of the columns
    fig.update_traces(
        marker_color=color
    )
    
    # If custom axes titles were passed, update the axes accordingly
    if axes_titles:
        fig.update_layout(
            xaxis_title=axes_titles[0], 
            yaxis_title=axes_titles[1]
        )
        
    # Adjust the font, remove horizontal gaps between the bins and center the title
    fig.update_layout(
        uniformtext_minsize=chart_font_size, 
        uniformtext_mode="hide", 
        bargap=0, 
        title_x=0.5
    )
    
    # Show the finalized plot
    fig.show()
In [14]:
def plot_mult_histogram(
    df: pd.DataFrame,
    x_column: str,
    categories_col: str,
    n_bins: int,
    title: str,
    axes_titles: Optional[List[str]]=None
    ) -> None:
    """
    Plot multiple histograms for the same columns (break down
    each bin by category).
    """
    # Given a single dataframe, plot an histogram, specifying one\
    # column for the X axis and another to represent the categories\
    # of each bin
    fig = px.histogram(
        df, 
        x=x_column, 
        color=categories_col,
        nbins=n_bins,
        title=title
    )
    
    # If custom axes titles were passed, update the axes accordingly
    if axes_titles:
        fig.update_layout(
            xaxis_title=axes_titles[0], 
            yaxis_title=axes_titles[1]
        )
        
    # Adjust the font, remove horizontal gaps between the bins and center the title
    fig.update_layout(
        uniformtext_minsize=chart_font_size, 
        uniformtext_mode="hide", 
        bargap=0, 
        title_x=0.5
    )
    
    # Show the finalized plot
    fig.show()
In [15]:
def plot_table(
    df: pd.DataFrame,
    column_headers: List[str],
    data_columns: List[str],
    title: str
    ) -> None:
    """
    Given a DataFrame, plot a two-column table using its data.
    """
    # Plot the table using the custom column headers received\
    # and the Series of the DataFrame as the data to be displayed
    fig = go.Figure(
        data=[
            go.Table(
                header=dict(
                    values=column_headers,
                    fill_color="paleturquoise", 
                    align="left"
                ),
                cells=dict(
                    values=[
                        df[ data_columns[0] ], 
                        df[ data_columns[1] ]
                    ],
                    fill_color="lavender",
                    align="left"
                )
            )
        ])

    # Add an horizontally-centered text
    fig.update_layout(
        title_text=title,
        title_x=0.5
    )

    # Show the plot
    fig.show()

Data Loading

In [16]:
# Load the dataset
survey_results = pd.read_csv("../data/survey_results_public.csv")

# Number of respondents
num_respondents = survey_results.shape[0]

# Fill missing numerical series' data with zeroes for the time being
survey_results["Age"] = survey_results["Age"].fillna(0)
survey_results["WorkWeekHrs"] = survey_results["WorkWeekHrs"].fillna(0)
survey_results["ConvertedComp"] = survey_results["ConvertedComp"].fillna(0)

# General data
ages = survey_results["Age"]
genders = survey_results["Gender"].astype("str")
ed_levels = survey_results["EdLevel"].astype("str")
employment_statuses = survey_results["Employment"].astype("str")
countries = survey_results["Country"].astype("str")

# Tech Stack data
languages = survey_results["LanguageWorkedWith"].astype("str")
platforms = survey_results["PlatformWorkedWith"].astype("str")
databases = survey_results["DatabaseWorkedWith"].astype("str")
web_frameworks = survey_results["WebFrameWorkedWith"].astype("str")
dev_envs = survey_results["DevEnviron"].astype("str")
op_sys = survey_results["OpSys"].astype("str")

# Professional Life data
underg_majors = survey_results["UndergradMajor"].astype("str")
prog_usages = survey_results["MainBranch"].astype("str")
work_week_hrs = survey_results["WorkWeekHrs"]
work_locs = survey_results["WorkLoc"].astype("str")
job_factors = survey_results["JobFactors"].astype("str")
resume_updates = survey_results["ResumeUpdate"].astype("str")
annual_compensations = survey_results["ConvertedComp"]

General Data

This first section explores general data about the respondents, including their age, gender, country, academic background, and more. As such, this section is useful to get to know more about the people that completed the survey.

Age

In [17]:
# Explore the raw data visually to find its limits
# scatter_plot_preview(ages) # 9-90 years

# Mark the outliers as missing data (younger than\
# 9 years and older than 90)
ages = ages.where(ages >= 9, np.NaN)
ages = ages.where(ages <= 90, np.NaN)

# Create the bin labels
ages_labels = pd.Series([f"[{i}, {i+10})" for i in range(0, 81, 10)])

# Create the bin intervals (closed on the left side)
ages_bins = pd.IntervalIndex.from_tuples(
    [(i, i+10) for i in range(0, 81, 10)],
    closed="left"
)

# Bin the respondent ages into nine bins
ages = pd.cut(
    ages, ages_bins, 
    right=False, 
    labels=ages_labels, 
    precision=0, 
    include_lowest=True
)

# Drop missing values
ages = ages.dropna()

# Sort the binned data in ascending order
ages.sort_values(ascending=True, inplace=True)

# Change the values from categorical to string to be able to plot them
ages = ages.astype("str")

# Plot an histogram for the binned ages
plot_histogram(
    ages,
    ages_bins.shape[0], 
    "Age Distribution of Respondents", 
    general_color,
    axes_titles=[None, None]
)

Gender

In [18]:
# Group the options stored for gender using different labels
genders = genders.map(
    lambda response: 
        "NA" if (response == "nan")
        else "Man" if (response == "Man")
        else "Woman" if (response == "Woman")
        else "Other"
)

# Count the frequencies of each gender option
genders_counts = count_options_frequencies(genders)

# Create a DataFrame for the genders and their frequencies
genders_df = prepare_dataframe(
    genders_counts, 
    ["Gender", "Respondents"]
)

# Filter out blanks for the plot
genders_plot = genders_df[genders_df["Gender"] != "NA"]

# Plot a pie chart with the gender distribution
plot_pie_chart(
    genders_plot, 
    "Respondents", 
    "Gender", 
    "Gender Distribution of Respondents",
    general_color_gradient,
    show_legend=False,
    donut_hole=0.6
)

Education Level

In [19]:
# Reword the options stored for education level
ed_levels = ed_levels.map(
    lambda response:
        "NA" if (response == "nan")
        else "No formal education" if (response == "I never completed any formal education")
        else "Primary School" if (response == "Primary/elementary school")
        else "Secondary School" if (response == "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)")
        else "Higher ed. study w/o Degree" if (response == "Some college/university study without earning a degree")
        else "Associate Degree" if (response == "Associate degree")
        else "Bachelor’s Degree" if (response == "Bachelor’s degree (BA, BS, B.Eng., etc.)")
        else "Master’s Degree" if (response == "Master’s degree (MA, MS, M.Eng., MBA, etc.)")
        else "Professional Degree" if (response == "Professional degree (JD, MD, etc.)")
        else "Doctoral Degree"
)

# Count the frequencies of each education level option
ed_levels_counts = count_options_frequencies(ed_levels)

# Create a DataFrame for the education level options and\
# their frequencies
ed_levels_df = prepare_dataframe(
    ed_levels_counts, 
    ["EducationLevel", "Respondents"], 
    ascend_sort=True
)

# Filter out blanks for the plot
ed_levels_plot = ed_levels_df[ed_levels_df["EducationLevel"] != "NA"]

# Plot a bar chart with education level options frequencies
plot_bar_chart(
    ed_levels_plot,
    "Respondents", "EducationLevel",
    "Respondents",
    "Education Level of Respondents",
    general_color,
    axes_titles=["Respondents", None]
)

Undergrad Major

In [20]:
# Reword the options stored for undergrad majors
underg_majors = underg_majors.map(
    lambda response:
        "NA" if (response == "nan")
        else "CompSci/SoftEng" if (response == "Computer science, computer engineering, or software engineering")
        else "Web Development/Design" if (response == "Web development or web design")
        else "InfoTech/SysAdmin" if (response == "Information systems, information technology, or system administration")
        else "Mathematics/Statistics" if (response == "Mathematics or statistics")
        else "Another Eng. Discipline" if (response == "Another engineering discipline (ex. civil, electrical, mechanical)")
        else "Business" if (response == "A business discipline (ex. accounting, finance, marketing)")
        else "Health Science" if (response == "A health science (ex. nursing, pharmacy, radiology)")
        else "Humanities" if (response == "A humanities discipline (ex. literature, history, philosophy)")
        else "Natural Science" if (response == "A natural science (ex. biology, chemistry, physics)")
        else "Social Science" if (response == "A social science (ex. anthropology, psychology, political science)")
        else "Fine Arts/Performing Arts" if (response == "Fine arts or performing arts (ex. graphic design, music, studio art)")
        else "None" if (response == "I never declared a major")
        else "Other"
)

# Count the frequencies of each undegrad major option
underg_majors_counts = count_options_frequencies(underg_majors)

# Create a DataFrame for the undergrad majors and their frequencies
underg_majors_df = prepare_dataframe(
    underg_majors_counts, 
    ["UndergradMajor", "Respondents"],
    ascend_sort=True
)

# Filter out blanks for the plot
underg_majors_plot = underg_majors_df[
    underg_majors_df["UndergradMajor"] != "NA"
]

# Plot a bar chart for the undergrad majors
plot_bar_chart(
    underg_majors_plot,
    "Respondents", "UndergradMajor",
    "Respondents",
    "Most Common Undergrad Majors",
    general_color,
    axes_titles=["Respondents", None]
)

Employment Status

In [21]:
# Reword the options stored for employment status
employment_statuses = employment_statuses.map(
    lambda response:
        "NA" if (response == "nan")
        else "Not looking for work" if (response == "Not employed, and not looking for work")
        else "Looking for work" if (response == "Not employed, but looking for work")
        else "Full-time" if (response == "Employed full-time")
        else "Part-time" if (response == "Employed part-time")
        else "Freelancer/Self-employed" if (response == "Independent contractor, freelancer, or self-employed")
        else "Retired" if (response == "Retired")
        else "Other"
)

# Count the frequencies of each employment status option
employment_statuses_counts = count_options_frequencies(employment_statuses)

# Create a DataFrame for the employment statuses and their frequencies
employment_statuses_df = prepare_dataframe(
    employment_statuses_counts, 
    ["EmploymentStatus", "Respondents"], 
    ascend_sort=True
)

# Filter out blanks for the plot
employment_statuses_plot = employment_statuses_df[
    employment_statuses_df["EmploymentStatus"] != "NA"
]

# Plot a bar chart with the employment statuses
plot_bar_chart(
    employment_statuses_plot,
    "Respondents", "EmploymentStatus",
    "Respondents",
    f"Employment Status of Respondents",
    general_color,
    axes_titles=["Respondents", None]
)

Country

In [22]:
# Count the frequencies of each country
countries_counts = count_options_frequencies(countries)

# Create a DataFrame for the countries and their frequencies
countries_df = prepare_dataframe(
    countries_counts, 
    ["Country", "Respondents"], 
    ascend_sort=True
)

# Filter out blanks for the plot
countries_plot = countries_df[countries_df["Country"] != "NA"]

# Plot a bar chart with the top 10 countries
plot_bar_chart(
    countries_plot.tail(n=10),
    "Respondents", "Country",
    "Respondents",
    f"Top 10 Countries with most Respondents",
    general_color,
    axes_titles=["Respondents", None]
)

Key Takeaways

  • Over 40% of the respondents were between the ages of 20 and 30 and almost 70% were between 20 and 40 years-old;
  • Over 90% of the inquired were male, with less than 8% being of the female gender. The remaining 2% identified as another gender;
  • About 43% of the respondents completed a Bachelor's Degree, but only half of those have gone on to complete a Master's Degree;
  • More than half of respondents come from a Computer Science or Software Engineering academic background. The remaining inquired come primarily from other Engineering disciplines, Information Technologies, System Administration or Web Design and/or Development;
  • 70% of the respondents have a full-time job, with freelance and/or self-employment being the second most common employment status;
  • The five countries with most respondents are, from most to least, the United States of America, India, Germany, the United Kingdom and Canada.

Tech Stack

The "Tech Stack" data comprehends questions regarding what technologies and technical skills the respondents are proficient with.

In other words, the Tech Stack section is about answering questions such as "Which are the most used programming languages?", "Which frameworks are used for web development?" or even "Which are the respondents' favorite development environments/IDEs?".

Programming Languages

In [23]:
# Extract the individual languages from each\
# reply to count their frequencies
languages_counts = count_options_frequencies(languages)

# Create a DataFrame for the languages and their frequencies
languages_df = prepare_dataframe(
    languages_counts, 
    ["Language", "Respondents"]
)

# Filter out blanks for the plot
languages_plot = languages_df[languages_df["Language"] != "NA"]

# Plot a column chart for the top 10 languages
plot_column_chart(
    languages_plot.head(n=10),
    "Language", "Respondents",
    "Respondents",
    "Top 10 Programming Languages",
    tech_stack_color,
    axes_titles=[None, None]
)

Database Environments

In [24]:
# Extract the individual databases from each\
# reply to count their frequencies
databases_counts = count_options_frequencies(databases)

# Create a DataFrame for the databases and their frequencies
databases_df = prepare_dataframe(
    databases_counts, 
    ["Database", "Respondents"]
)

# Filter out blanks for the plot
databases_plot = databases_df[databases_df["Database"] != "NA"]

# Plot a column chart for the top 10 databases
plot_column_chart(
    databases_plot.head(n=10),
    "Database", "Respondents",
    "Respondents",
    "Top 10 Database Environments",
    tech_stack_color,
    axes_titles=[None, None]
)

Web Frameworks

In [25]:
# Extract the individual web frameworks from each\
# reply to count their frequencies
web_frameworks_counts = count_options_frequencies(web_frameworks)

# Create a DataFrame for the web frameworks and their frequencies
web_frameworks_df = prepare_dataframe(
    web_frameworks_counts, 
    ["WebFramework", "Respondents"]
)

# Filter out blanks for the plot
web_frameworks_plot = web_frameworks_df[
    web_frameworks_df["WebFramework"] != "NA"
]

# Plot a column chart for the top 10 web frameworks
plot_column_chart(
    web_frameworks_plot.head(n=10),
    "WebFramework", "Respondents",
    "Respondents",
    "Top 10 Web Frameworks",
    tech_stack_color,
    axes_titles=[None, None]
)

Development Environments

In [26]:
# Extract the individual development environments from each\
# reply to count their frequencies
dev_envs_counts = count_options_frequencies(dev_envs)

# Create a DataFrame for the development environments and\
# their frequencies
dev_envs_df = prepare_dataframe(
    dev_envs_counts, 
    ["DevEnv", "Respondents"]
)

# Filter out blanks for the plot
dev_envs_plot = dev_envs_df[dev_envs_df["DevEnv"] != "NA"]

# Plot a column chart for the top 5 development environments
plot_column_chart(
    dev_envs_plot.head(n=5),
    "DevEnv", "Respondents",
    "Respondents",
    "Top 5 Favorite IDEs/Development Environments",
    tech_stack_color,
    axes_titles=[None, None]
)

Operating Systems

In [27]:
# Count the frequencies of each operating system
op_sys_counts = count_options_frequencies(op_sys)

# Create a dataframe for the frequencies
op_sys_df = prepare_dataframe(
    op_sys_counts, 
    ["OperatingSystem", "Respondents"]
)

# Filter out blanks for the plot
op_sys_plot = op_sys_df[op_sys_df["OperatingSystem"] != "NA"]

# Plot a pie chart for the operating system distribution
plot_pie_chart(
    op_sys_plot, 
    "Respondents", 
    "OperatingSystem", 
    f"Operating System Usage",
    tech_stack_color_gradient,
    show_legend=False,
    donut_hole=0.6
)

Key Takeaways

  • Over 60% of the respondents use JavaScript and/or HTML/CSS, over 50% use SQL and 40% use Python and/or Java;
  • Over 46% of the respondents use MySQL for database environments. On the other hand, only about 23% use PostgreSQL and/or Microsoft SQL Server, the second and third most popular environments, respectively;
  • jQuery is still the most used Web Framework, used by almost a third of the inquired. React.js comes in second place (22%) and Angular(.js) in a very close third place;
  • In a group of 10 developers, about 5 of them reported using Visual Studio Code as one of their IDEs/Development Environments. About 3 people in the group use Visual Studio and/or Notepad++ as well;
  • Windows is the operating system of choice of almost half of the respondents. The remaining half is fairly evenly divided between MacOS and Linux-based systems, but the former takes second place.

Professional Life

This third section pertains to getting to know the professional lives of the respondents: their work location (remote vs office), what are the most important job factors for them, what role programming plays in their professional lives, annual compensation and more.

Professional Usage of Programming

In [28]:
# Reword the options stored for programming usages
prog_usages = prog_usages.map(
    lambda response:
        "NA" if (response == "nan")
        else "Student" if (response == "I am a student who is learning to code")
        else "Code as part of job" if (response == "I am not primarily a developer, but I write code sometimes as part of my work")
        else "Hobbyist" if (response == "I code primarily as a hobby")
        else "Professional developer" if (response == "I am a developer by profession")
        else "Used to work as developer" if (response == "I used to be a developer by profession, but no longer am")
        else "Other"
)

# Count the frequencies of each programming usage option
prog_usages_counts = count_options_frequencies(prog_usages)

# Create a DataFrame for the programming usages and their frequencies
prog_usages_df = prepare_dataframe(
    prog_usages_counts, 
    ["ProgUsages", "Respondents"]
)

# Filter out blanks for the plot
prog_usages_plot = prog_usages_df[prog_usages_df["ProgUsages"] != "NA"]

# Plot a column chart for the programming usages
plot_column_chart(
    prog_usages_plot,
    "ProgUsages", "Respondents",
    "Respondents",
    "Professional Usage of Programming",
    people_color,
    axes_titles=[None, None]
)

Work Week Hours

In [29]:
# Explore the raw data visually to find its limits
# scatter_plot_preview(work_week_hrs) # 8-90 hours

# Mark the outliers as missing data (less than 8 hours\
# and more than 90)
work_week_hrs = work_week_hrs.where(work_week_hrs >= 8, np.NaN)
work_week_hrs = work_week_hrs.where(work_week_hrs <= 90, np.NaN)

# Create the bin labels
work_week_hrs_labels = pd.Series([f"[{i}, {i+10})" for i in range(0, 81, 10)])

# Create the bin intervals (closed on the left)
work_week_hrs_bins = pd.IntervalIndex.from_tuples(
    [(i, i+10) for i in range(0, 81, 10)], 
    closed="left"
)

# Bin the work week hours Series into nine bins
work_week_hrs = pd.cut(
    work_week_hrs, work_week_hrs_bins, 
    right=False, 
    labels=work_week_hrs_labels, 
    precision=0, 
    include_lowest=True
)

# Drop missing values
work_week_hrs = work_week_hrs.dropna()

# Sort the binned data in ascending order
work_week_hrs.sort_values(ascending=True, inplace=True)

# Change the values from categorical to string to be able to plot them
work_week_hrs = work_week_hrs.astype("str")

# Plot an histogram for the binned work week hours
plot_histogram(
    work_week_hrs,
    work_week_hrs_bins.shape[0], 
    f"Work Week Length (Hours)", 
    people_color,
    axes_titles=[None, None]
)

Work Location

In [30]:
# Reword the options stored for work locations
work_locs = work_locs.map(
    lambda response:
        "NA" if (response == "nan")
        else "Office" if (response == "Office")
        else "Remote" if (response == "Home")
        else "Remote" if (response == "Other place, such as a coworking space or cafe")
        else "Other"
)

# Count the frequencies of each work locaation option
work_locs_counts = count_options_frequencies(work_locs)

# Create a DataFrame for the work locations and their frequencies
work_locs_df = prepare_dataframe(
    work_locs_counts, 
    ["WorkLocation", "Respondents"]
)

# Filter out blanks for the plot
work_locs_plot = work_locs_df[work_locs_df["WorkLocation"] != "NA"]

# Plot a pie chart for the work locations
plot_pie_chart(
    work_locs_plot, 
    "Respondents", 
    "WorkLocation", 
    "Work Location - Office vs Remote ",
    people_color_gradient,
    show_legend=False,
    donut_hole=0.6
)

Most Important Job Factors

In [31]:
# Count the frequencies of each job factor
job_factors_counts = count_options_frequencies(job_factors)

# Create a DataFrame for the job factors and their frequencies
job_factors_df = prepare_dataframe(job_factors_counts, ["JobFactor", "Respondents"])

# Reword the options stored for work locations
job_factors_df["JobFactor"] = job_factors_df["JobFactor"].map(
    lambda response:
        "NA" if (response == "nan")
        else "Technical compatibility" if (response == "Languages, frameworks, and other technologies I'd be working with")
        else "Culture compatibility" if (response == "Office environment or company culture")
        else "Flexible schedule" if (response == "Flex time or a flexible schedule")
        else "Professional development" if (response == "Opportunities for professional development")
        else "Remote work options" if (response == "Remote work options")
        else "Impact of my work" if (response == "How widely used or impactful my work output would be")
        else "Industry compatibility" if (response == "Industry that I'd be working in")
        else "Company's financial performance" if (response == "Financial performance or funding status of the company or organization")
        else "Department/Team compatibility" if (response == "Specific department or team I'd be working on")
        else "Diversity of the company" if (response == "Diversity of the company or organization")
        else "Other"
)

# Filter out blanks for the plot
job_factors_plot = job_factors_df[job_factors_df["JobFactor"] != "NA"]

# Plot a column chart for the top 5 job factors
plot_column_chart(
    job_factors_plot.head(n=5),
    "JobFactor", "Respondents",
    "Respondents",
    "Five Most Important Job Factors",
    people_color,
    axes_titles=[None, None]
)

Reason for latest Resume Update

In [32]:
# Reword the options stored for resume update reasons
resume_updates = resume_updates.map(
    lambda response:
        "NA" if (response == "nan")
        else "New achievement" if (response == "Something else changed (education, award, media, etc.)")
        else "Job search" if (response == "I was preparing for a job search")
        else "Job opportunity" if (response == "I heard about a job opportunity (from a recruiter, online job posting, etc.)")
        else "Job status change" if (response == "My job status changed (promotion, new job, etc.)")
        else "Negative experience at work" if (response == "I had a negative experience or interaction at work")
        else "Re-entry into workforce" if (response == "Re-entry into the workforce")
        else "Other"
)

# Count the frequencies of each resume update reason
resume_updates_counts = count_options_frequencies(resume_updates)

# Create a DataFrame for the resume update reasons and their frequencies
resume_updates_df = prepare_dataframe(
    resume_updates_counts, 
    ["UpdateReason", "Respondents"]
)

# Filter out blanks for the plot
resume_updates_plot = resume_updates_df[
    resume_updates_df["UpdateReason"] != "NA"
]

# Plot a column chart for the resume update reasons
plot_column_chart(
    resume_updates_plot,
    "UpdateReason", "Respondents",
    "Respondents",
    "Reason for Latest Resume Update",
    people_color,
    axes_titles=[None, None]
)

Annual Compensation (USD)

In [33]:
# Explore the raw data visually to define its limits
# scatter_plot_preview(annual_compensations) # 0-150k USD

# Mark the outliers as missing data (less than 0 USD\
# and more than 150k USD)
annual_compensations = annual_compensations.where(
    annual_compensations >= 0, 
    np.NaN
)
annual_compensations = annual_compensations.where(
    annual_compensations <= 150000, 
    np.NaN
)

# Create the bin labels
annual_compensations_labels = pd.Series(
    [f"[{i:,}, {i+25000:,})" for i in range(0, 125001, 25000)]
)

# Create the bin intervals
annual_compensations_bins = pd.IntervalIndex.from_tuples(
    [(i, i+25000) for i in range(0, 125001, 25000)],
    closed="left"
)

# Bin the annual compensation Series into six bins
annual_compensations = pd.cut(
    annual_compensations, annual_compensations_bins, 
    right=False, 
    labels=annual_compensations_labels, 
    precision=0, 
    include_lowest=True
)

# Drop missing values
annual_compensations = annual_compensations.dropna()

# Sort the binned data in ascending order
annual_compensations.sort_values(ascending=True, inplace=True)

# Change the values from categorical to string to be able to plot them
annual_compensations = annual_compensations.astype("str")

# Plot an histogram for the binned annual compensation
plot_histogram(
    annual_compensations,
    annual_compensations_bins.shape[0], 
    "Annual Compensation (USD)", 
    people_color,
    axes_titles=[None, None]
)

Key Takeaways

  • Almost three quarters of the inquired are developers by profession. Over 10% of the respondents are students and 8% need to code as part of their job;
  • 44% of the respondents work between 40 to 50 hours per week. 11% work between 30 to 40 hours and 6% between 50 to 60 hours;
  • In a group of 10 people, close to 6 of them work at an office and 4 work remotely;
  • Technical compatibility is a deal-breaker for almost half of the inquired when looking for a job. However, culture compatibility, flexible schedule and opportunity for professional development are equally important for about 40% of the respondents;
  • 37% of the respondents' most recent resume update was to search for a job. A change in job status came in second, reported by 15% of the inquired;
  • Half of the inquired reported an annual compensation of 25,000\$ or less. In comparison, 13% reported an annual compensation between 25,000\\$ and 50,000\$ and 11% between 50,000\\$ and 75,000\$. The remaining quarter receive between 75,000\\$ and 150,000\$.

Other

After getting to know the respondents through general information, what technologies they use and some aspects of their professional lives, in this last section we will combine data from different sections to answer the last two questions of the analysis:

  • How does remote work factor in today's professional life (compensation and work week hours);
  • Tech Stack data vs Compensation

Remote Work in Professional Life

In [34]:
# Remove blanks from the original work location series
work_locs = work_locs[work_locs != "NA"]

# Create a DataFrame using the annual compensations Series
compens_workloc_df = pd.DataFrame(annual_compensations)

# Join the original work location Series to the DataFrame, using\
# the Series' index. 
# The inner join assures the DataFrame will have only responses\
# where both the annual compensation and the work location were\
# reported
compens_workloc_df = compens_workloc_df.join(work_locs, how="inner")

# Plot an histogram for the annual compensation, broken down by\
# work location
plot_mult_histogram(
    compens_workloc_df,
    "ConvertedComp",
    "WorkLoc",
    annual_compensations_bins.shape[0],
    f"Annual Compensation vs Work Location",
    axes_titles=["Annual Compensation (USD)", None]
)
In [35]:
# Create a DataFrame using the work week hours series
workweekhrs_workloc_df = pd.DataFrame(work_week_hrs)

# Join the original work location Series to the DataFrame, using\
# the Series' index. 
# The inner join assures the DataFrame will have only responses\
# where both the work week hours and the work location were\
# reported
workweekhrs_workloc_df = workweekhrs_workloc_df.join(
    work_locs, 
    how="inner"
)

# Plot an histogram of the work week length broken down by work\
# location
plot_mult_histogram(
    workweekhrs_workloc_df,
    "WorkWeekHrs",
    "WorkLoc",
    work_week_hrs_bins.shape[0],
    f"Work Week Length (Hours) vs Work Location",
    axes_titles=["Work Week Hours", None]
)

Programming Languages Compensation

In [36]:
# Remove blanks from the original programming language replies
languages = languages[languages != "nan"]

# Since this question was a multiple choice question, each\
# reply can have more than one option, separated by semicolons.\
# Thus, the individual options need to be separated into different\
# rows, resulting in a multi-level index: the first level represents\
# each reply and the second level each option included in that reply
languages_unpivot = unpivot_delimited_data(languages, ";")

# Reset the compensations' index so that we have a column\
# with the index to select during the merge 
annual_compensations_reset = annual_compensations.reset_index()

# Merge the unpivoted data with the annual compensations, leaving\
# out replies where one of the questions was not answered
languages_compens_df = merge_unpivoted_data(
    languages_unpivot, 
    annual_compensations_reset, 
    "index", 
    "Language", 
    "ConvertedComp"
)

# Get the top 10 programming languages, in descending order
top_10_langs = languages_plot.head(n=10)["Language"].unique()

# Remove the rows with languages not in the top 10
languages_compens_df = languages_compens_df[
    languages_compens_df["Language"].isin(top_10_langs)
]

# Convert the column of unpivoted data into the Categorical\
# data type, using the top n list as the available categories.\
# This makes it so that when the data is sorted it keeps the\
# order of the top n list
languages_compens_df = sort_by_top(
    languages_compens_df, 
    "Language", 
    top_10_langs
)

# Group the DataFrame by programming language and find the\
# most common compensation interval for each
languages_compens_mode = get_mode_df(
    languages_compens_df, 
    "Language", 
    "ConvertedComp"
)

# Show a table with the most common compensation interval for\
# each programming language
plot_table(
    languages_compens_mode, 
    ["Programming Language", "Compensation Interval"], 
    ["Language", "ConvertedComp"],
    "Most common Annual Compensation level (USD)<br>for the top 10 Programming Languages"
)

Database Environments Compensation

In [37]:
# Remove blanks from the original database replies
databases = databases[databases != "nan"]

# Since this question was a multiple choice question, each\
# reply can have more than one option, separated by semicolons.\
# Thus, the individual options need to be separated into different\
# rows, resulting in a multi-level index: the first level represents\
# each reply and the second level each option included in that reply
databases_unpivot = unpivot_delimited_data(databases, ";")

# Merge the unpivoted data with the annual compensations, leaving\
# out replies where one of the questions was not answered
databases_compens_df = merge_unpivoted_data(
    databases_unpivot, 
    annual_compensations_reset, 
    "index", 
    "Database", 
    "ConvertedComp"
)

# Get the top 10 databases, in descending order
top_10_dbs = databases_plot.head(n=10)["Database"].unique()

# Remove the rows with databases not in the top 10
databases_compens_df = databases_compens_df[
    databases_compens_df["Database"].isin(top_10_dbs)
]

# Convert the column of unpivoted data into the Categorical\
# data type, using the top n list as the available categories.\
# This makes it so that when the data is sorted it keeps the\
# order of the top n list
databases_compens_df = sort_by_top(
    databases_compens_df, 
    "Database", 
    top_10_dbs
)

# Group the DataFrame by database and find the most common\
# compensation interval for each
databases_compens_mode = get_mode_df(
    databases_compens_df, 
    "Database", 
    "ConvertedComp"
)

# Show a table with the most common compensation interval for\
# each database environment
plot_table(
    databases_compens_mode, 
    ["Database Environment", "Compensation Interval"], 
    ["Database", "ConvertedComp"],
    "Most common Annual Compensation level (USD)<br>for the top 10 Database Environments"
)

Web Development Compensation

In [38]:
# Remove blanks from the original database replies
web_frameworks = web_frameworks[web_frameworks != "nan"]

# Since this question was a multiple choice question, each\
# reply can have more than one option, separated by semicolons.\
# Thus, the individual options need to be separated into different\
# rows, resulting in a multi-level index: the first level represents\
# each reply and the second level each option included in that reply
web_frameworks_unpivot = unpivot_delimited_data(web_frameworks, ";")

# Merge the unpivoted data with the annual compensations, leaving\
# out replies where one of the questions was not answered
web_frameworks_compens_df = merge_unpivoted_data(
    web_frameworks_unpivot, 
    annual_compensations_reset, 
    "index", 
    "WebFramework", 
    "ConvertedComp"
)

# Get the top 10 web frameworks, in descending order
top_10_webfws = web_frameworks_plot.head(n=10)["WebFramework"].unique()

# Remove the rows with web frameworks not in the top 10
web_frameworks_compens_df = web_frameworks_compens_df[
    web_frameworks_compens_df["WebFramework"].isin(top_10_webfws)
]

# Convert the column of unpivoted data into the Categorical\
# data type, using the top n list as the available categories.\
# This makes it so that when the data is sorted it keeps the\
# order of the top n list
web_frameworks_compens_df = sort_by_top(
    web_frameworks_compens_df, 
    "WebFramework", 
    top_10_webfws
)

# Group the DataFrame by web frameworks and find the most common\
# compensation interval for each
# web_frameworks_compens_mode = web_frameworks_compens_df.groupby(["WebFramework"])\
#     .apply( lambda x: x["ConvertedComp"].mode() )
web_frameworks_compens_mode = get_mode_df(
    web_frameworks_compens_df, 
    "WebFramework", 
    "ConvertedComp"
)

# Show a table with the most common compensation interval for\
# each web framework
plot_table(
    web_frameworks_compens_mode, 
    ["Web Framework", "Compensation Interval"], 
    ["WebFramework", "ConvertedComp"],
    "Most common Annual Compensation level (USD)<br>for the top 10 Web Development Frameworks"
)

Key Takeaways

  • Looking at the respondents who reported both their annual compensation and their work location, working in person at an office has a small advantage only for those with an annual compensation below 75,000$;
  • Looking at those who reported both their work week hours and their work location, roughly two out of three respondents that work between 30 to 60 hours per week reported working in an office;
  • As expected, given the data seen previously for annual compensation, the most common level of annual compensation for any programming language, database environment or web development framework is less than 25,000$.

Conclusions

General Data

  • Over 40% of the respondents were between the ages of 20 and 30 and almost 70% were between 20 and 40 years-old;
  • Over 90% of the inquired were male, with less than 8% being of the female gender. The remaining 2% identified as another gender;
  • About 43% of the respondents completed a Bachelor's Degree, but only half of those have gone on to complete a Master's Degree;
  • More than half of respondents come from a Computer Science or Software Engineering academic background. The remaining inquired come primarily from other Engineering disciplines, Information Technologies, System Administration or Web Design and/or Development;
  • 70% of the respondents have a full-time job, with freelance and/or self-employment being the second most common employment status;
  • The five countries with most respondents are, from most to least, the United States of America, India, Germany, the United Kingdom and Canada.

Tech Stack

  • Over 60% of the respondents use JavaScript and/or HTML/CSS, over 50% use SQL and 40% use Python and/or Java;
  • Over 46% of the respondents use MySQL for database environments. On the other hand, only about 23% use PostgreSQL and/or Microsoft SQL Server, the second and third most popular environments, respectively;
  • jQuery is still the most used Web Framework, used by almost a third of the inquired. React.js comes in second place (22%) and Angular(.js) in a very close third place;
  • In a group of 10 developers, about 5 of them reported using Visual Studio Code as one of their IDEs/Development Environments. About 3 people in the group use Visual Studio and/or Notepad++ as well;
  • Windows is the operating system of choice of almost half of the respondents. The remaining half is fairly evenly divided between MacOS and Linux-based systems, but the former takes second place.

Professional Life

  • Almost three quarters of the inquired are developers by profession. Over 10% of the respondents are students and 8% need to code as part of their job;
  • 44% of the respondents work between 40 to 50 hours per week. 11% work between 30 to 40 hours and 6% between 50 to 60 hours;
  • In a group of 10 people, close to 6 of them work at an office and 4 work remotely;
  • Technical compatibility is a deal-breaker for almost half of the inquired when looking for a job. However, culture compatibility, flexible schedule and opportunity for professional development are equally important for about 40% of the respondents;
  • 37% of the respondents' most recent resume update was to search for a job. A change in job status came in second, reported by 15% of the inquired;
  • Half of the inquired reported an annual compensation of 25,000\$ or less. In comparison, 13% reported an annual compensation between 25,000\\$ and 50,000\$ and 11% between 50,000\\$ and 75,000\$. The remaining quarter receive between 75,000\\$ and 150,000\$.

Other

  • Looking at the respondents who reported both their annual compensation and their work location, working in person at an office has a small advantage only for those with an annual compensation below 75,000$;
  • Looking at those who reported both their work week hours and their work location, roughly two out of three respondents that work between 30 to 60 hours per week reported working in an office;
  • As expected, given the data seen previously for annual compensation, the most common level of annual compensation for any programming language, database environment or web development framework is less than 25,000$.